--- layout: post title: Using jupyter notebooks with plotly graphs in a jekyll blog subtitle: This post will walk you through how you can render plotly graphs inside your jekyll blog using jupyter notebook tags: [tutorial,plotly,python,jupyter,jekyll] ---
The spa industry has witnessed tremendous growth in recent years as individuals seek rejuvenation, relaxation, and wellness amidst the fast-paced and stress-filled modern lifestyle. In this competitive landscape, spa businesses face the constant challenge of attracting and retaining clients. To thrive in this environment, spas must not only provide exceptional services but also employ effective marketing strategies that engage and reactivate their existing client base.
This analysis aims to explore and leverage the power of RFM (Recency, Frequency, Monetary) analysis to understand client behavior and develop targeted approaches to reactivate clients within a spa setting. RFM analysis has proven to be a valuable tool in customer segmentation, allowing businesses to identify and prioritize their most valuable clients based on their purchasing patterns. By examining the recency of client visits, the frequency of their visits, and their monetary contributions, spas can gain insights into client preferences, tailor personalized experiences, and implement strategies to boost customer engagement and loyalty.
The primary aim of this project is to utilize RFM analysis and other cluster techniques to reactivate clients within a spa setting. By understanding client behavior and preferences, the project aims to develop targeted strategies that engage and encourage dormant clients to return for spa services. The objectives of the project are as follows:
Conduct RFM Analysis: Perform an in-depth analysis of client data, including recency, frequency, and monetary value, to identify segments of clients based on their purchasing patterns. This analysis will help determine the most valuable clients, as well as those who have lapsed or become inactive.
Segment Client Base: Utilize cluster analysis techniques to further segment the spa's client base into distinct groups with similar characteristics, preferences, and behaviors. This segmentation will enable the development of personalized marketing approaches for each group.
Identify Reactivation Opportunities: Identify dormant or inactive clients within the client segments and analyze the factors that contributed to their disengagement. This analysis will provide insights into the reasons for client attrition and inform the development of reactivation strategies.
The analysis of the spa's historical data involves working with several dimensional tables and fact tables that provide essential information about the clients, employees, treatment categories, and treatments themselves. Here is an overview of the different tables:
Clients Table: This table contains information about the spa's clients, such as client ID, name, contact details, demographics, and other relevant attributes. It serves as a primary source of data for understanding client behavior and preferences.
Employees Table: The employees table includes information about the spa's staff members, including their ID, and name.
Categories Table: This table provides information about the different categories of treatments offered by the spa. It includes details such as category ID, name, description, pricing, and any additional attributes that define the treatment categories.
Treatments Table: The treatments table is a fact table that connects the clients, employees, and treatment categories. It contains records of individual treatments provided to clients, including information such as treatment ID, client ID, employee ID, treatment category, date/time of treatment, and any associated costs.
To ensure the privacy and confidentiality of individuals, it is important to emphasize that no personal information will be disclosed or exposed in the analysis. As the data used for this project is real and contains sensitive information, strict measures will be implemented to anonymize and protect personal data. Names, contact details, and any other personally identifiable information will be removed. The focus of the analysis will be on understanding trends, patterns, and behaviors at an aggregate level, without compromising the confidentiality of any specific individuals. By upholding these ethical standards, the analysis can be conducted in a responsible and respectful manner, maintaining the integrity of the data source.
# Impor package
import numpy as np
import pandas as pd
import datetime
from unidecode import unidecode
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import datetime as dt
import seaborn as sb
import matplotlib.pyplot as plt
from cleaning import *
%matplotlib inline
from IPython.display import display
pd.set_option('display.max_columns', None)
# load dataset
clientes = pd.read_excel("C:/Users/casar/Documents/Portafolio/Spa analysis/Data/T_Clientes.xlsx")
empregados = pd.read_excel("C:/Users/casar/Documents/Portafolio/Spa analysis/Data/T_Empregados.xlsx")
tratamentos = pd.read_excel("C:/Users/casar/Documents/Portafolio/Spa analysis/Data/T_Tratamentos.xlsx")
produtos = pd.read_excel("C:/Users/casar/Documents/Portafolio/Spa analysis/Data/T_Produtos.xlsx")
# Datasets with name in portuguese according to the gender
woman = pd.read_excel("C:/Users/casar/Documents/Portafolio/Spa analysis/Data/Nomes.xlsx", sheet_name='Woman')
man = pd.read_excel("C:/Users/casar/Documents/Portafolio/Spa analysis/Data/Nomes.xlsx", sheet_name='Men')
# Select important features for the analysis
clientes = clientes[['ID_Cli', 'Nome', 'DataNascimento', 'Localidade', 'Nacionalidade', 'Telefone', 'Telemovel', 'Email', 'Notas', 'Saldo']]
# Replace null values in telemovel for the values in Telefone
clientes['Telemovel'] = np.where(clientes['Telemovel'].isnull(), clientes['Telefone'], clientes['Telemovel'])
# Delete Null values in Telemovel
clientes_clean = clientes.dropna(subset=['Telemovel'])
clientes_valid = cleaning_phone(clientes_clean, 'Telemovel')
clientes_valid = create_gender(clientes_valid, woman, man)
clientes_valid = create_birthday(clientes_valid, 'DataNascimento')
Clients with the same phone number
# Create a dataframe with the duplicates rows phones
duplicates = clientes_valid.duplicated(subset='Telemovel_clean')
duplicated_rows = clientes_valid[duplicates]
# list with the duplicates telemovels
duplicate_phones = list(duplicated_rows['Telemovel_clean'].unique())
# Create a dataframe with all the person that has duplicate phone number
clientes_duplicate_phones = clientes_valid[clientes_valid['Telemovel_clean'].isin(duplicate_phones)]
# dataframe with the ids of the person with less than 18 years old
young_clients = clientes_duplicate_phones[(clientes_duplicate_phones['Idade']<=16) & (clientes_duplicate_phones['Idade']>0)]
# dataframe with the person with more than 90 years old
old_clientes = clientes_duplicate_phones[(clientes_duplicate_phones['Idade']>=82)]
# dataframe with person with too high or low age
no_clients_age = pd.concat([young_clients, old_clientes])
# create a new dataframe that doesn't contain people in extreme ages
clientes_duplicate_phones_age = clientes_duplicate_phones[~clientes_duplicate_phones['ID_Cli'].isin(no_clients_age['ID_Cli'])]
# select the person with the mgreates age by telemovel
max_age = clientes_duplicate_phones_age.groupby('Telemovel_clean')['Idade'].max().reset_index()
# merge the dataframe wih the persons with apporpiate age and the onew with the max_age and then drop duplicates
clients_no_dup = clientes_duplicate_phones_age.merge(max_age, on = ['Telemovel_clean', 'Idade'], how='right').drop_duplicates(subset='Telemovel_clean')
# the ids to remove are the ones that are not in clients_no_dup
id_dup = [x for x in clientes_duplicate_phones['ID_Cli'].unique() if x not in clients_no_dup['ID_Cli'].unique()]
same client with different id
# Duplicated clients
duplicates = clientes_valid.duplicated(subset=['Telemovel', 'Idade', 'Gender'])
# Sort the dataset by telemovel and name
same_client_diff_id = clientes_valid[clientes_valid['Telemovel_clean'].isin(clientes_valid[duplicates]['Telemovel_clean'].unique())].sort_values(['Telemovel', 'Nome'])
# Shift the id by 1
same_client_diff_id['id_shifted'] = same_client_diff_id.groupby(['Telemovel_clean', 'Idade', 'Gender'])['ID_Cli'].shift(1)
# create a database with the id and the shifted id
shifted_ids = same_client_diff_id[['ID_Cli', 'id_shifted']].dropna()
ids_to_remove = [x for x in id_dup if x not in shifted_ids['id_shifted'].unique()] + list(shifted_ids['ID_Cli'].unique())
ids_to_remove = list(set(ids_to_remove))
# remove the duplicate ids from clientes
clientes_valid = clientes_valid[~clientes_valid['ID_Cli'].isin(ids_to_remove)]
# remove clients don't want to be contacted
off_notes = ['não quer ser incomodada', 'não quer ser contactado']
clientes_valid = clientes_valid[~clientes_valid['Notas'].isin(off_notes)]
Clean clients dataset
clientes_ = create_gender(clientes, woman, man)
clientes_ = create_birthday(clientes_, 'DataNascimento')
# remove the ids of the same person
clientes_ = clientes_[~clientes_['ID_Cli'].isin(shifted_ids['ID_Cli'].unique())]
clientes_.info()
<class 'pandas.core.frame.DataFrame'> Index: 1068 entries, 0 to 1078 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID_Cli 1068 non-null int64 1 Nome 1062 non-null object 2 DataNascimento 775 non-null datetime64[ns] 3 Localidade 177 non-null object 4 Nacionalidade 330 non-null object 5 Telefone 36 non-null object 6 Telemovel 1008 non-null object 7 Email 561 non-null object 8 Notas 96 non-null object 9 Saldo 1068 non-null int64 10 Gender 1068 non-null object 11 First name 1068 non-null object 12 Idade 1068 non-null int32 13 range_age 1068 non-null object dtypes: datetime64[ns](1), int32(1), int64(2), object(10) memory usage: 121.0+ KB
df_bar = clientes_.groupby('range_age')['ID_Cli'].count().reset_index()
df_pie = clientes_.groupby('Gender')['ID_Cli'].count().reset_index()
df_bar_M = clientes_[clientes_['Gender'] == 'Male'].groupby(['range_age'])['ID_Cli'].count().reset_index()
df_bar_F = clientes_[clientes_['Gender'] == 'Female'].groupby(['range_age'])['ID_Cli'].count().reset_index()
df_bar['color'] ='rgb(102,194,165)'
df_bar_F['color'] ='rgb(102,194,165)'
df_bar_M['color'] ='rgb(252,141,98)'
fig = make_subplots(
rows=2, cols=2,
specs=[[{"type": "bar"}, {"type": "pie"}],
[{"colspan": 2}, None]],
column_widths=[0.7, 0.3], vertical_spacing=0.09, horizontal_spacing=0.02,
subplot_titles=("Range age","Gender", "Age by gender")
)
fig.add_trace(go.Bar(x=df_bar['range_age'], y=df_bar['ID_Cli'],marker=dict(color=df_bar['color'] ),
name='Age'), row=1, col=1)
fig.add_trace(go.Pie(labels=df_pie['Gender'], values=df_pie['ID_Cli'],
marker=dict(colors=['rgb(102,194,165)', 'rgb(252,141,98)', 'rgb(141,160,203)']),
hole=0.5, hoverinfo='label+percent+value', textinfo='label'),
row=1, col=2)
fig.add_trace(go.Bar(x=df_bar_F['range_age'], y=df_bar_F['ID_Cli'],
marker=dict(color= df_bar_F['color']), name='Female'),
row=2, col=1)
fig.add_trace(go.Bar(x=df_bar_M['range_age'], y=df_bar_M['ID_Cli'],
marker=dict(color= df_bar_M['color']), name='Male'),
row=2, col=1)
fig.update_yaxes(showgrid=False, ticksuffix=' ', categoryorder='total ascending', row=1, col=1)
fig.update_xaxes(tickmode = 'array', row=1, col=1)
fig.update_xaxes(tickmode = 'array', row=2, col=1)
fig.update_yaxes(visible=False, row=2, col=1, categoryorder='total ascending')
fig.update_layout(height=500, bargap=0.2,
margin=dict(b=0.04,r=20,l=20), xaxis=dict(tickmode='linear'),
title_text="Analyzing Clients by Gender and Age Range",
template="plotly_dark",
title_font=dict(size=25, color='#8a8d93', family="Lato, sans-serif"),
font=dict(color='#8a8d93'),
title_x=0.5,
hoverlabel=dict(bgcolor="#f2f2f2", font_size=13, font_family="Lato, sans-serif"),
showlegend=False)
fig.show()
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# Plot and embed in ipython notebook!
iplot(fig,show_link=True)
# Clients with no name
clientes_[clientes_['Gender']== 'Other']
| ID_Cli | Nome | DataNascimento | Localidade | Nacionalidade | Telefone | Telemovel | Notas | Saldo | Gender | First name | Idade | range_age |
|---|
produtos.describe(include='all')
| ID_Prd | Produto | Categoria | Valor | Qtd | |
|---|---|---|---|---|---|
| count | 105.000000 | 105 | 105 | 105.000000 | 105.0 |
| unique | NaN | 105 | 17 | NaN | NaN |
| top | NaN | Cabelos | Depliação Cera | NaN | NaN |
| freq | NaN | 1 | 15 | NaN | NaN |
| mean | 53.000000 | NaN | NaN | 77.961905 | 1.0 |
| std | 30.454885 | NaN | NaN | 122.758982 | 0.0 |
| min | 1.000000 | NaN | NaN | 0.000000 | 1.0 |
| 25% | 27.000000 | NaN | NaN | 20.000000 | 1.0 |
| 50% | 53.000000 | NaN | NaN | 35.000000 | 1.0 |
| 75% | 79.000000 | NaN | NaN | 60.000000 | 1.0 |
| max | 105.000000 | NaN | NaN | 675.000000 | 1.0 |
produtos.head()
| ID_Prd | Produto | Categoria | Valor | Qtd | |
|---|---|---|---|---|---|
| 0 | 1 | Cabelos | Cabelo | 0 | 1 |
| 1 | 2 | Produtos do cabelo | Cabelo | 0 | 1 |
| 2 | 3 | manicure | Manicure | 8 | 1 |
| 3 | 4 | gel | Manicure | 25 | 1 |
| 4 | 5 | gel com extensoes | Manicure | 28 | 1 |
# produto with highest value
produtos[produtos['Valor'] == max(produtos['Valor'])]
| ID_Prd | Produto | Categoria | Valor | Qtd | |
|---|---|---|---|---|---|
| 91 | 92 | Pack 5 sessoes cav+rf+presso | Pack Estética Corporal | 675 | 1 |
# produto with lowest value
produtos[produtos['Valor'] == min(produtos['Valor'])]
| ID_Prd | Produto | Categoria | Valor | Qtd | |
|---|---|---|---|---|---|
| 0 | 1 | Cabelos | Cabelo | 0 | 1 |
| 1 | 2 | Produtos do cabelo | Cabelo | 0 | 1 |
| 98 | 99 | outros | Outros | 0 | 1 |
| 100 | 101 | Prestacao de Servicos | Outros | 0 | 1 |
| 101 | 102 | pagamento | Outros | 0 | 1 |
| 102 | 103 | Reiki | Reiki | 0 | 1 |
| 103 | 104 | Microblanding | Depilação Linha | 0 | 1 |
| 104 | 105 | Tratamento capilar | Tratamento Capilar | 0 | 1 |
# Produtos with categoria Outros
produtos_outros = produtos[produtos['Categoria'] == 'Outros'].groupby('Produto').agg({'Valor': np.sum, 'Qtd': 'count'}).reset_index()
produtos_outros
| Produto | Valor | Qtd | |
|---|---|---|---|
| 0 | Prestacao de Servicos | 0 | 1 |
| 1 | outros | 0 | 1 |
| 2 | pagamento | 0 | 1 |
# Filter by categoria cabelo to see what are the produtos related with this category
produtos[produtos['Categoria'] == 'Cabelo'].groupby('Produto').agg({'Valor': np.sum, 'Qtd': 'count'}).reset_index()
| Produto | Valor | Qtd | |
|---|---|---|---|
| 0 | Cabelos | 0 | 1 |
| 1 | Produtos do cabelo | 0 | 1 |
fig = px.bar(produtos.groupby(['Categoria', 'Produto'])['Valor'].sum().reset_index(), x='Categoria', y='Valor', color= 'Produto')
fig.update_yaxes(showgrid=False, ticksuffix=' ')
fig.update_layout(barmode='stack', xaxis={'categoryorder': 'total descending'},
title_text="Price per categoria by produto",
template="plotly_dark",
title_font=dict(size=25, color='#8a8d93', family="Lato, sans-serif"),
font=dict(color='#8a8d93'),
title_x=0.5,)
fig.show()
empregados.head()
| ID_Emp | Empregados | |
|---|---|---|
| 0 | 1 | Aurelio Mesquita |
| 1 | 2 | Carla Sebastiao |
| 2 | 3 | Ana |
| 3 | 4 | Patricia |
| 4 | 5 | João |
empregados.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 29 entries, 0 to 28 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID_Emp 29 non-null int64 1 Empregados 29 non-null object dtypes: int64(1), object(1) memory usage: 592.0+ bytes
tratamentos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11018 entries, 0 to 11017 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID_Tra 11018 non-null int64 1 Obs 7141 non-null object 2 ID_Cli 11011 non-null float64 3 Data 11016 non-null datetime64[ns] 4 ID_Prod 11017 non-null float64 5 ID_Emp 11018 non-null int64 6 ID_MP 11018 non-null int64 7 Apagar 11013 non-null float64 8 ValorPago 11018 non-null float64 9 DataPg 11014 non-null datetime64[ns] 10 Factura 11018 non-null bool 11 ID_Sala 0 non-null float64 12 HoraInicio 0 non-null float64 13 HoraFim 0 non-null float64 dtypes: bool(1), datetime64[ns](2), float64(7), int64(3), object(1) memory usage: 1.1+ MB
# The last row seems to be the total sum of the ValorPago
tratamentos
| ID_Tra | Obs | ID_Cli | Data | ID_Prod | ID_Emp | ID_MP | Apagar | ValorPago | DataPg | Factura | ID_Sala | HoraInicio | HoraFim | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7914 | 2 de 10 | 710.0 | 2020-05-25 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2020-05-25 | False | NaN | NaN | NaN |
| 1 | 7915 | 5 de 10 | 728.0 | 2020-05-25 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2020-05-25 | False | NaN | NaN | NaN |
| 2 | 7916 | NaN | 302.0 | 2020-05-25 | 67.0 | 1 | 13 | 60.0 | 60.0 | 2020-05-25 | False | NaN | NaN | NaN |
| 3 | 7917 | NaN | 194.0 | 2020-05-26 | 17.0 | 4 | 13 | 17.5 | 17.5 | 2020-05-26 | False | NaN | NaN | NaN |
| 4 | 7918 | NaN | 194.0 | 2020-05-26 | 17.0 | 4 | 3 | 20.0 | 20.0 | 2020-05-26 | False | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11013 | 11186 | NaN | 1070.0 | 2023-05-23 | 82.0 | 1 | 3 | 25.0 | 25.0 | 2023-05-23 | False | NaN | NaN | NaN |
| 11014 | 11187 | bandagem abd | 366.0 | 2023-05-23 | 99.0 | 2 | 3 | 15.0 | 15.0 | 2023-05-23 | False | NaN | NaN | NaN |
| 11015 | 11188 | 6 de 10 | 366.0 | 2023-05-23 | 82.0 | 2 | 7 | 0.0 | 0.0 | 2023-05-23 | False | NaN | NaN | NaN |
| 11016 | 11189 | +ax+buço | 1091.0 | 2023-05-23 | 34.0 | 2 | 3 | 31.0 | 31.0 | 2023-05-23 | False | NaN | NaN | NaN |
| 11017 | 11190 | +ax Mariana | 1091.0 | 2023-05-23 | 34.0 | 2 | 3 | 28.0 | 28.0 | 2023-05-23 | False | NaN | NaN | NaN |
11018 rows × 14 columns
tratamentos = tratamentos.iloc[:-1 , :]
# Check that the last row is the sum of ValorPago
tratamentos['ValorPago'].sum()
265680.34
# change type of ValorPago to float
tratamentos.loc[:, 'ValorPago'] = tratamentos['ValorPago'].astype(float)
tratamentos.describe(include='all')
| ID_Tra | Obs | ID_Cli | Data | ID_Prod | ID_Emp | ID_MP | Apagar | ValorPago | DataPg | Factura | ID_Sala | HoraInicio | HoraFim | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 11017.000000 | 7140 | 11010.000000 | 11015 | 11016.000000 | 11017.000000 | 11017.000000 | 11012.000000 | 11017.000000 | 11013 | 11017 | 0.0 | 0.0 | 0.0 |
| unique | NaN | 2481 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 | NaN | NaN | NaN |
| top | NaN | brush | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False | NaN | NaN | NaN |
| freq | NaN | 654 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 10704 | NaN | NaN | NaN |
| mean | 5616.668875 | NaN | 333.270572 | 2019-12-03 13:12:45.102133504 | 38.912672 | 5.614323 | 5.627666 | 24.391027 | 24.115489 | 2019-12-03 08:54:00.098065920 | NaN | NaN | NaN | NaN |
| min | 1.000000 | NaN | 1.000000 | 2018-01-29 00:00:00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2018-01-29 00:00:00 | NaN | NaN | NaN | NaN |
| 25% | 2814.000000 | NaN | 76.000000 | 2018-10-22 00:00:00 | 1.000000 | 2.000000 | 3.000000 | 8.000000 | 8.000000 | 2018-10-22 00:00:00 | NaN | NaN | NaN | NaN |
| 50% | 5632.000000 | NaN | 238.000000 | 2019-06-28 00:00:00 | 16.000000 | 3.000000 | 3.000000 | 15.000000 | 15.000000 | 2019-06-28 00:00:00 | NaN | NaN | NaN | NaN |
| 75% | 8419.000000 | NaN | 504.000000 | 2020-10-22 00:00:00 | 69.000000 | 7.000000 | 7.000000 | 30.000000 | 30.000000 | 2020-10-22 00:00:00 | NaN | NaN | NaN | NaN |
| max | 11189.000000 | NaN | 1111.000000 | 2023-05-23 00:00:00 | 105.000000 | 29.000000 | 13.000000 | 1500.000000 | 660.000000 | 2023-05-23 00:00:00 | NaN | NaN | NaN | NaN |
| std | 3231.899329 | NaN | 293.704714 | NaN | 38.494232 | 6.972885 | 3.978983 | 43.156380 | 37.861834 | NaN | NaN | NaN | NaN | NaN |
# Most common observations
tratamentos.groupby('Obs')['ID_Tra'].count().reset_index().sort_values('ID_Tra', ascending = False).head(15)
| Obs | ID_Tra | |
|---|---|---|
| 993 | brush | 654 |
| 1249 | corte | 351 |
| 894 | br | 98 |
| 2316 | tinta | 87 |
| 1030 | brush e tinta | 80 |
| 1664 | gelinho | 79 |
| 357 | 2 de 3 | 74 |
| 255 | 1 de 3 | 73 |
| 1015 | brush e corte | 73 |
| 2329 | tinta e brush | 70 |
| 416 | 3 de 3 | 69 |
| 2401 | v.gel | 57 |
| 344 | 2 de 10 | 49 |
| 482 | 4 de 5 | 47 |
| 400 | 3 de 10 | 46 |
produtos[produtos['ID_Prd'] == 1]
| ID_Prd | Produto | Categoria | Valor | Qtd | |
|---|---|---|---|---|---|
| 0 | 1 | Cabelos | Cabelo | 0 | 1 |
# with the observations is possible to increase the information related with the categoria "cabelo"
tratamentos[tratamentos['ID_Prod']== 1].groupby(['Obs'])['ID_Tra'].count().reset_index().sort_values('ID_Tra', ascending=False).head()
| Obs | ID_Tra | |
|---|---|---|
| 166 | brush | 654 |
| 343 | corte | 350 |
| 70 | br | 98 |
| 618 | tinta | 87 |
| 203 | brush e tinta | 80 |
# Clients with id null
tratamentos[tratamentos['ID_Cli'].isnull()]
| ID_Tra | Obs | ID_Cli | Data | ID_Prod | ID_Emp | ID_MP | Apagar | ValorPago | DataPg | Factura | ID_Sala | HoraInicio | HoraFim | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 256 | 8935 | +buco | NaN | 2021-06-09 | 20.0 | 2 | 3 | 10.0 | 10.0 | 2021-06-09 | False | NaN | NaN | NaN |
| 1333 | 99 | NaN | NaN | NaT | 24.0 | 4 | 3 | 19.0 | 19.0 | NaT | False | NaN | NaN | NaN |
| 4147 | 3503 | NaN | NaN | 2018-12-17 | 67.0 | 1 | 3 | 56.0 | 56.0 | 2018-12-17 | False | NaN | NaN | NaN |
| 9162 | 9218 | madeixas | NaN | 2021-08-26 | 1.0 | 24 | 3 | 60.0 | 60.0 | 2021-08-26 | False | NaN | NaN | NaN |
| 9924 | 9926 | NaN | NaN | 2022-04-07 | NaN | 0 | 0 | NaN | 0.0 | NaT | False | NaN | NaN | NaN |
| 9925 | 9927 | NaN | NaN | 2022-04-07 | 1.0 | 0 | 0 | 0.0 | 0.0 | NaT | False | NaN | NaN | NaN |
| 10296 | 10301 | a linha | NaN | 2022-07-07 | 14.0 | 28 | 3 | 5.0 | 5.0 | 2022-07-07 | False | NaN | NaN | NaN |
# Produtos with id null
tratamentos[tratamentos['ID_Prod'].isnull()]
| ID_Tra | Obs | ID_Cli | Data | ID_Prod | ID_Emp | ID_MP | Apagar | ValorPago | DataPg | Factura | ID_Sala | HoraInicio | HoraFim | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9924 | 9926 | NaN | NaN | 2022-04-07 | NaN | 0 | 0 | NaN | 0.0 | NaT | False | NaN | NaN | NaN |
# difference betweem valorPago y Apagar
tratamentos.loc[:, 'diff'] = tratamentos['ValorPago'] - tratamentos['Apagar']
C:\Users\casar\AppData\Local\Temp\ipykernel_19320\1040080632.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# if difference is negative is because the person was in debt
tratamentos[tratamentos['diff']<0]
| ID_Tra | Obs | ID_Cli | Data | ID_Prod | ID_Emp | ID_MP | Apagar | ValorPago | DataPg | Factura | ID_Sala | HoraInicio | HoraFim | diff | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 81 | 8866 | sobrancelha | 808.0 | 2021-05-19 | 14.0 | 25 | 3 | 6.0 | 5.0 | 2021-05-19 | False | NaN | NaN | NaN | -1.0 |
| 107 | 7968 | feito pela Carla tinha credito de 5€ da sessão... | 745.0 | 2020-06-15 | 80.0 | 15 | 3 | 15.0 | 10.0 | 2020-06-15 | False | NaN | NaN | NaN | -5.0 |
| 145 | 8006 | NaN | 750.0 | 2020-06-26 | 67.0 | 1 | 12 | 70.0 | 0.0 | 2020-06-26 | False | NaN | NaN | NaN | -70.0 |
| 165 | 8027 | NaN | 749.0 | 2020-07-03 | 69.0 | 2 | 12 | 50.0 | 0.0 | 2020-07-03 | False | NaN | NaN | NaN | -50.0 |
| 190 | 8052 | 94,400kg | 69.0 | 2020-07-08 | 69.0 | 1 | 12 | 40.0 | 0.0 | 2020-07-08 | False | NaN | NaN | NaN | -40.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10829 | 11002 | NaN | 8.0 | 2023-03-11 | 69.0 | 1 | 0 | 60.0 | 0.0 | 2023-03-11 | False | NaN | NaN | NaN | -60.0 |
| 10854 | 11027 | +pedicure | 843.0 | 2023-03-23 | 8.0 | 26 | 12 | 25.0 | 0.0 | 2023-03-23 | False | NaN | NaN | NaN | -25.0 |
| 10857 | 11030 | tinha credito de 5€ | 1103.0 | 2023-03-24 | 69.0 | 2 | 3 | 45.0 | 40.0 | 2023-03-24 | False | NaN | NaN | NaN | -5.0 |
| 10883 | 11056 | NaN | 44.0 | 2023-03-30 | 67.0 | 1 | 12 | 60.0 | 0.0 | 2023-03-30 | False | NaN | NaN | NaN | -60.0 |
| 11012 | 11185 | 1 de 10 | 69.0 | 2023-05-22 | 69.0 | 1 | 10 | 500.0 | 0.0 | 2023-05-22 | False | NaN | NaN | NaN | -500.0 |
444 rows × 15 columns
# people that have a positive balance
tratamentos[tratamentos['diff']>0].tail()
| ID_Tra | Obs | ID_Cli | Data | ID_Prod | ID_Emp | ID_MP | Apagar | ValorPago | DataPg | Factura | ID_Sala | HoraInicio | HoraFim | diff | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10741 | 10912 | unhas catarina | 237.0 | 2023-02-03 | 102.0 | 26 | 13 | 0.0 | 15.0 | 2023-02-03 | False | NaN | NaN | NaN | 15.0 |
| 10762 | 10933 | 2 de 10 | 69.0 | 2023-02-11 | 69.0 | 1 | 3 | 0.0 | 500.0 | 2023-02-11 | False | NaN | NaN | NaN | 500.0 |
| 10763 | 10934 | 2 de 10 turbinada | 366.0 | 2023-02-13 | 99.0 | 2 | 13 | 0.0 | 50.0 | 2023-02-13 | False | NaN | NaN | NaN | 50.0 |
| 10863 | 11036 | NaN | 843.0 | 2023-03-28 | 102.0 | 26 | 13 | 0.0 | 25.0 | 2023-03-28 | False | NaN | NaN | NaN | 25.0 |
| 10884 | 11057 | NaN | 44.0 | 2023-03-31 | 102.0 | 1 | 13 | 0.0 | 60.0 | 2023-03-31 | False | NaN | NaN | NaN | 60.0 |
According to the last two filters, we can see that sometimes people who owe money have a positive balance, while in other cases, there doesn't appear to be a clear reason. Additionally, sometimes people pay more than the price of the service, which results in a positive balance. Therefore, it's necessary to create a new column with the actual values of the service. The new variable is created as follows:
tratamentos.loc[:, 'ValorTotal'] = np.where(tratamentos['diff']<=0, tratamentos['ValorPago']+np.abs(tratamentos['diff']),
np.where(tratamentos['Apagar'] ==0, tratamentos['ValorPago'], tratamentos['ValorPago']-(tratamentos['diff']))
)
C:\Users\casar\AppData\Local\Temp\ipykernel_19320\687474659.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# ValorPago equal to 0
tratamentos[tratamentos['ValorTotal'] == 0]
| ID_Tra | Obs | ID_Cli | Data | ID_Prod | ID_Emp | ID_MP | Apagar | ValorPago | DataPg | Factura | ID_Sala | HoraInicio | HoraFim | diff | ValorTotal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7914 | 2 de 10 | 710.0 | 2020-05-25 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2020-05-25 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 1 | 7915 | 5 de 10 | 728.0 | 2020-05-25 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2020-05-25 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 11 | 7925 | 3 de 10 | 710.0 | 2020-05-27 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2020-05-27 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 16 | 7930 | filha mafalda | 23.0 | 2020-05-30 | 37.0 | 2 | 7 | 0.0 | 0.0 | 2020-05-30 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 18 | 7932 | 6 de 6 | 517.0 | 2020-06-01 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2020-06-01 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11004 | 11177 | dra Vera 16 de 18 | 237.0 | 2023-05-19 | 69.0 | 1 | 7 | 0.0 | 0.0 | 2023-05-19 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 11005 | 11178 | 4 de 12 | 302.0 | 2023-05-19 | 67.0 | 1 | 7 | 0.0 | 0.0 | 2023-05-19 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 11009 | 11182 | + meso 4 de 10 | 52.0 | 2023-05-04 | 80.0 | 2 | 7 | 0.0 | 0.0 | 2023-05-22 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 11010 | 11183 | + pantallas 5 de 10 | 52.0 | 2023-05-22 | 80.0 | 2 | 7 | 0.0 | 0.0 | 2023-05-22 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 11015 | 11188 | 6 de 10 | 366.0 | 2023-05-23 | 82.0 | 2 | 7 | 0.0 | 0.0 | 2023-05-23 | False | NaN | NaN | NaN | 0.0 | 0.0 |
1400 rows × 16 columns
# The zero value in valor total can be related to the products being divided by sessions, where the person pays for all the sessions at the beginning of the treatment
tratamentos[tratamentos['ID_Cli'] == 1094].sort_values('Data')
| ID_Tra | Obs | ID_Cli | Data | ID_Prod | ID_Emp | ID_MP | Apagar | ValorPago | DataPg | Factura | ID_Sala | HoraInicio | HoraFim | diff | ValorTotal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 394 | 10688 | 1 de 3 | 1094.0 | 2022-11-18 | 80.0 | 2 | 13 | 35.0 | 35.0 | 2022-11-18 | False | NaN | NaN | NaN | 0.0 | 35.0 |
| 417 | 10711 | 2 de 3 | 1094.0 | 2022-11-25 | 80.0 | 2 | 7 | 0.0 | 0.0 | 2022-11-25 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 10575 | 10744 | 3 de 3 | 1094.0 | 2022-12-06 | 80.0 | 2 | 7 | 0.0 | 0.0 | 2022-12-06 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 10576 | 10745 | dlm 1 de 5 | 1094.0 | 2022-12-06 | 99.0 | 2 | 13 | 75.0 | 75.0 | 2022-12-06 | False | NaN | NaN | NaN | 0.0 | 75.0 |
| 10597 | 10766 | dlm 2 de 5 | 1094.0 | 2022-12-15 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2022-12-15 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 10622 | 10791 | dlm 3 de 5 | 1094.0 | 2022-12-22 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2022-12-22 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 10721 | 10892 | dlm 4 de 5 | 1094.0 | 2023-01-27 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2023-01-27 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 10744 | 10915 | dlm 5 de 5 | 1094.0 | 2023-02-03 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2023-02-03 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 10767 | 10939 | dlm 1 de 5 | 1094.0 | 2023-02-13 | 99.0 | 2 | 13 | 75.0 | 75.0 | 2023-02-13 | False | NaN | NaN | NaN | 0.0 | 75.0 |
| 10881 | 11054 | dlm 2 de 5 | 1094.0 | 2023-03-31 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2023-03-31 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 10917 | 11090 | 3 de 5 | 1094.0 | 2023-04-15 | 80.0 | 2 | 7 | 0.0 | 0.0 | 2023-04-15 | False | NaN | NaN | NaN | 0.0 | 0.0 |
import re
# tratamentos with ValorPago zero and obersvations that don't contain a number, 'oferta' or 'nan'
tratamentos.loc[:,'Obs'] = tratamentos['Obs'].astype(str)
tratamentos[(~tratamentos['Obs'].str.contains(r'\d')) & (~tratamentos['Obs'].str.contains('oferta')) & (tratamentos['ValorTotal'] == 0) & (tratamentos['Obs'] != 'nan')]
| ID_Tra | Obs | ID_Cli | Data | ID_Prod | ID_Emp | ID_MP | Apagar | ValorPago | DataPg | Factura | ID_Sala | HoraInicio | HoraFim | diff | ValorTotal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | 7930 | filha mafalda | 23.0 | 2020-05-30 | 37.0 | 2 | 7 | 0.0 | 0.0 | 2020-05-30 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 114 | 7975 | já terminou | 341.0 | 2020-06-17 | 68.0 | 2 | 7 | 0.0 | 0.0 | 2020-06-17 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 1283 | 49 | dr teresa | 43.0 | 2018-02-15 | 99.0 | 1 | 9 | 0.0 | 0.0 | 2018-02-15 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 1315 | 81 | retirar gelinho | 63.0 | 2018-02-19 | 7.0 | 6 | 9 | 0.0 | 0.0 | 2018-02-19 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 1400 | 166 | aparar barba | 95.0 | 2018-02-24 | 99.0 | 3 | 9 | 0.0 | 0.0 | 2018-02-24 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 1653 | 385 | autorizado pelo Aurelio Mesquita | 176.0 | 2018-03-19 | 1.0 | 3 | 9 | 0.0 | 0.0 | 2018-03-19 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 2936 | 1584 | tinta e corte | 216.0 | 2018-07-07 | 1.0 | 3 | 12 | 0.0 | 0.0 | 2018-07-07 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 4091 | 3447 | marido | 85.0 | 2018-12-14 | 68.0 | 2 | 6 | 0.0 | 0.0 | 2018-12-14 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 4464 | 3878 | a cliente não gostou da franja, por isso devol... | 488.0 | 2019-01-19 | 1.0 | 3 | 9 | 0.0 | 0.0 | 2019-01-19 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 6918 | 6482 | criolipolise | 314.0 | 2019-09-07 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2019-09-07 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 7310 | 6877 | amiga Esmeralda | 201.0 | 2019-10-14 | 67.0 | 1 | 9 | 0.0 | 0.0 | 2019-10-14 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 7739 | 7800 | voucher cliente Anita | 201.0 | 2020-03-09 | 67.0 | 1 | 7 | 0.0 | 0.0 | 2020-03-09 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 7742 | 7803 | br | 604.0 | 2020-03-09 | 1.0 | 19 | 9 | 0.0 | 0.0 | 2020-03-09 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 7751 | 7812 | brush | 309.0 | 2020-03-11 | 1.0 | 19 | 9 | 0.0 | 0.0 | 2020-03-11 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 8256 | 9728 | mae do Celso tinha um credito | 69.0 | 2022-02-17 | 69.0 | 1 | 6 | 0.0 | 0.0 | 2022-02-17 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 8526 | 8569 | usou a hora toda | 52.0 | 2020-12-21 | 69.0 | 2 | 7 | 0.0 | 0.0 | 2020-12-21 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 8544 | 8587 | mesa radionica | 52.0 | 2020-12-30 | 99.0 | 2 | 7 | 0.0 | 0.0 | 2020-12-30 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 8939 | 8721 | avo usou | 52.0 | 2021-03-20 | 68.0 | 2 | 7 | 0.0 | 0.0 | 2021-03-20 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 9317 | 8688 | domicilio | 237.0 | 2021-03-04 | 101.0 | 2 | 7 | 0.0 | 0.0 | 2021-03-04 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 9716 | 9612 | lpg aniversario | 366.0 | 2022-01-13 | 99.0 | 2 | 9 | 0.0 | 0.0 | 2022-01-13 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 10347 | 10353 | Dra Vera | 237.0 | 2022-07-22 | 69.0 | 1 | 7 | 0.0 | 0.0 | 2022-07-22 | False | NaN | NaN | NaN | 0.0 | 0.0 |
# Check if client 488 returned after the treatment they didn't like
tratamentos[tratamentos['ID_Cli']==488].sort_values('Data')
| ID_Tra | Obs | ID_Cli | Data | ID_Prod | ID_Emp | ID_MP | Apagar | ValorPago | DataPg | Factura | ID_Sala | HoraInicio | HoraFim | diff | ValorTotal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4232 | 3589 | brush | 488.0 | 2018-12-22 | 1.0 | 3 | 3 | 10.0 | 10.0 | 2018-12-22 | False | NaN | NaN | NaN | 0.0 | 10.0 |
| 4233 | 3590 | shampoo+creme | 488.0 | 2018-12-22 | 1.0 | 3 | 3 | 4.0 | 4.0 | 2018-12-22 | False | NaN | NaN | NaN | 0.0 | 4.0 |
| 4255 | 3613 | 30 m | 488.0 | 2018-12-26 | 103.0 | 2 | 13 | 20.0 | 20.0 | 2018-12-26 | False | NaN | NaN | NaN | 0.0 | 20.0 |
| 4464 | 3878 | a cliente não gostou da franja, por isso devol... | 488.0 | 2019-01-19 | 1.0 | 3 | 9 | 0.0 | 0.0 | 2019-01-19 | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 5917 | 5412 | brus | 488.0 | 2019-06-08 | 1.0 | 3 | 3 | 10.0 | 10.0 | 2019-06-08 | False | NaN | NaN | NaN | 0.0 | 10.0 |
| 7903 | 7437 | brush | 488.0 | 2019-12-12 | 1.0 | 19 | 13 | 10.0 | 10.0 | 2019-12-12 | False | NaN | NaN | NaN | 0.0 | 10.0 |
| 7996 | 7530 | nan | 488.0 | 2019-12-24 | 1.0 | 5 | 13 | 50.0 | 50.0 | 2019-12-24 | False | NaN | NaN | NaN | 0.0 | 50.0 |
# Tratement with no date
tratamentos[tratamentos['Data'].isnull()]
| ID_Tra | Obs | ID_Cli | Data | ID_Prod | ID_Emp | ID_MP | Apagar | ValorPago | DataPg | Factura | ID_Sala | HoraInicio | HoraFim | diff | ValorTotal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 69 | 8854 | nan | 805.0 | NaT | 0.0 | 0 | 0 | 0.0 | 0.0 | NaT | False | NaN | NaN | NaN | 0.0 | 0.0 |
| 1333 | 99 | nan | NaN | NaT | 24.0 | 4 | 3 | 19.0 | 19.0 | NaT | False | NaN | NaN | NaN | 0.0 | 19.0 |
# Compare if Data and DataPg are the same
comparison = tratamentos['Data'] == tratamentos['DataPg']
comparison[comparison==False].head()
69 False 227 False 1333 False 1354 False 1635 False dtype: bool
# Delete not useful columns
tratamentos.drop(['Factura', 'ID_Sala', 'HoraInicio', 'HoraFim', 'diff', 'Apagar', 'ValorPago', 'DataPg'], axis =1, inplace =True)
C:\Users\casar\AppData\Local\Temp\ipykernel_19320\2096009762.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# change the ids for the same client with differnt ids
tratamentos_merge = pd.merge(tratamentos, shifted_ids, on='ID_Cli', how='left')
tratamentos.loc[:, 'ID_Cli'] = np.where(tratamentos_merge['id_shifted'].isna(), tratamentos_merge['ID_Cli'], tratamentos_merge['id_shifted'])
df_final = tratamentos.merge(produtos.drop(['Qtd'], axis=1), left_on='ID_Prod', right_on='ID_Prd', how='left')
df_final.drop(['ID_Prod'], axis=1, inplace=True)
df_final = df_final.merge(clientes_[['ID_Cli', 'Gender', 'Idade', 'range_age']], on='ID_Cli', how='left')
df_final = df_final.merge(empregados, on='ID_Emp', how='left')
df_final.head()
| ID_Tra | Obs | ID_Cli | Data | ID_Emp | ID_MP | ValorTotal | ID_Prd | Produto | Categoria | Valor | Gender | Idade | range_age | Empregados | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7914 | 2 de 10 | 710.0 | 2020-05-25 | 2 | 7 | 0.0 | 99.0 | outros | Outros | 0.0 | Female | 41.0 | [35-54] | Carla Sebastiao |
| 1 | 7915 | 5 de 10 | 728.0 | 2020-05-25 | 2 | 7 | 0.0 | 99.0 | outros | Outros | 0.0 | Female | 47.0 | [35-54] | Carla Sebastiao |
| 2 | 7916 | nan | 302.0 | 2020-05-25 | 1 | 13 | 60.0 | 67.0 | Mss Pedras quentes | Massagem | 70.0 | Male | -1.0 | No age | Aurelio Mesquita |
| 3 | 7917 | nan | 194.0 | 2020-05-26 | 4 | 13 | 17.5 | 17.0 | dep. Cera Feminino perna inteira | Depliação Cera | 25.0 | Female | 30.0 | [0-34] | Patricia |
| 4 | 7918 | nan | 194.0 | 2020-05-26 | 4 | 3 | 20.0 | 17.0 | dep. Cera Feminino perna inteira | Depliação Cera | 25.0 | Female | 30.0 | [0-34] | Patricia |
# Null ID_Cli
df_final[df_final['ID_Cli'].isnull()]
| ID_Tra | Obs | ID_Cli | Data | ID_Emp | ID_MP | ValorTotal | ID_Prd | Produto | Categoria | Valor | Gender | Idade | range_age | Empregados | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 256 | 8935 | +buco | NaN | 2021-06-09 | 2 | 3 | 10.0 | 20.0 | dep. Cera Feminino axilas | Depliação Cera | 10.0 | NaN | NaN | NaN | Carla Sebastiao |
| 1333 | 99 | nan | NaN | NaT | 4 | 3 | 19.0 | 24.0 | dep Cera Masculina axilas | Depliação Cera | 19.0 | NaN | NaN | NaN | Patricia |
| 4147 | 3503 | nan | NaN | 2018-12-17 | 1 | 3 | 56.0 | 67.0 | Mss Pedras quentes | Massagem | 70.0 | NaN | NaN | NaN | Aurelio Mesquita |
| 9162 | 9218 | madeixas | NaN | 2021-08-26 | 24 | 3 | 60.0 | 1.0 | Cabelos | Cabelo | 0.0 | NaN | NaN | NaN | Lucinda |
| 9924 | 9926 | nan | NaN | 2022-04-07 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9925 | 9927 | nan | NaN | 2022-04-07 | 0 | 0 | 0.0 | 1.0 | Cabelos | Cabelo | 0.0 | NaN | NaN | NaN | NaN |
| 10296 | 10301 | a linha | NaN | 2022-07-07 | 28 | 3 | 5.0 | 14.0 | dep. Cera Feminino buco | Depliação Cera | 7.0 | NaN | NaN | NaN | Bruna Machado |
# Create a new column with the produts for hair
df_final = produto_cabelo(df_final)
# Create features month and month name
df_final['Month'] = df_final['Data'].dt.month
df_final['Month_name'] = df_final['Data'].dt.month_name()
df_final = df_final.sort_values('Month')
# create feature season
bins = [0, 3, 6, 9, 12]
labels = ['Winter', 'Spring', 'Summer', 'Fall']
df_final['season'] = pd.cut(df_final['Data'].dt.month, bins=bins, labels=labels)
plot_trents(df_final, ['Empregados', 'Categoria'], 'ID_Tra', 'ValorTotal')
plot_trents(df_final, ['Gender', 'Categoria', 'Produto'], 'ID_Tra', 'ValorTotal')
plot_trents(df_final, ['season', 'Categoria', 'Gender'], 'ID_Tra', 'ValorTotal')
plot_trents(df_final, ['Month_name', 'Categoria', 'Gender'], 'ID_Tra', 'ValorTotal')
plot_trents(df_final, [ 'Categoria', 'Gender'], 'ID_Tra', 'ValorTotal')
# Number of clients by date
df_final['year'] = df_final['Data'].dt.year
dis = df_final.groupby(['Data'])['ID_Cli'].nunique().reset_index()
fig = px.line(dis, x='Data', y='ID_Cli')
fig.update_layout(template="plotly_dark")
fig.show()
# ValorTotal by date
dis = df_final.groupby(['Data'])['ValorTotal'].sum().reset_index()
fig = px.line(dis, x='Data', y='ValorTotal')
fig.update_layout(template="plotly_dark")
fig.show()
# drop gender other
df_final = df_final[df_final['Gender'] != 'Other' ]
rfm = create_rfm_columns(df_final)
# split the recency in 5 different categories
conditions = [
(rfm['Recency_months'] <= 3),
(rfm['Recency_months'] > 3 ) & (rfm['Recency_months'] <= 6),
(rfm['Recency_months'] > 6 ) & (rfm['Recency_months'] <= 12),
(rfm['Recency_months'] > 12 ) & (rfm['Recency_months'] <= 24),
]
choices = [5, 4, 3, 2]
rfm['R_val'] = np.select(conditions, choices, default=1)
# split the monetary vakue in 5 different categories
conditions = [
(rfm['Monetary'] > 2000),
(rfm['Monetary'] > 1000 ) & (rfm['Monetary'] <= 2000),
(rfm['Monetary'] > 250 ) & (rfm['Monetary'] <= 1000),
(rfm['Monetary'] > 50 ) & (rfm['Monetary'] <= 250),
]
choices = [5, 4, 3, 2]
rfm['M_val'] = np.select(conditions, choices, default=1)
# split the frequency in 5 different categories
conditions = [
(rfm['Frequency'] > 35),
(rfm['Frequency'] > 15 ) & (rfm['Frequency'] <= 35),
(rfm['Frequency'] > 5 ) & (rfm['Frequency'] <= 15),
(rfm['Frequency'] > 1 ) & (rfm['Frequency'] <= 5),
]
choices = [5, 4, 3, 2]
rfm['F_val'] = np.select(conditions, choices, default=1)
rfm["RFM_Score"] = rfm["R_val"].astype(str) +rfm["F_val"].astype(str) + rfm["M_val"].astype(str)
# Segments
seg_map = {
r'[1-2][1-2]': 'Hibernating',
r'[1-2][3-4]': 'At Risk',
r'[1-2]5': 'Can\'t Loose',
r'3[1-2]': 'About to Sleep',
r'33': 'Need Attention',
r'[3-4][4-5]': 'Loyal Customers',
r'41': 'Promising',
r'51': 'New Customers',
r'[4-5][2-3]': 'Potential Loyalists',
r'5[4-5]': 'Champions'
}
# Add segements to each client
rfm['Segment'] = rfm['R_val'].astype(str) + rfm['F_val'].astype(str)
rfm['Segment'] = rfm['Segment'].replace(seg_map, regex=True)
Summary for segment
rfm.reset_index().groupby(['Segment']).agg({'ID_Cli': 'count', 'Recency_months': np.mean, 'Frequency': np.mean,
'Monetary': np.mean}).round(2).sort_values('Monetary')
| ID_Cli | Recency_months | Frequency | Monetary | |
|---|---|---|---|---|
| Segment | ||||
| Promising | 7 | 4.22 | 1.00 | 32.79 |
| About to Sleep | 69 | 9.71 | 1.96 | 46.11 |
| Hibernating | 617 | 40.60 | 2.17 | 48.38 |
| New Customers | 2 | 1.45 | 1.00 | 60.00 |
| Potential Loyalists | 34 | 2.84 | 6.26 | 157.97 |
| Need Attention | 10 | 9.04 | 9.30 | 231.12 |
| At Risk | 256 | 41.56 | 13.50 | 287.87 |
| Can't Loose | 48 | 39.69 | 68.71 | 1394.76 |
| Loyal Customers | 22 | 7.07 | 42.18 | 1442.19 |
| Champions | 25 | 0.85 | 97.16 | 3945.71 |
join rfm analysis with the data
rfm.reset_index(inplace= True)
df_frm = df_final.merge(rfm, on='ID_Cli', how='left')
df_frm.head()
| ID_Tra | Obs | ID_Cli | Data | ID_Emp | ID_MP | ValorTotal | ID_Prd | Produto | Categoria | Valor | Gender | Idade | range_age | Empregados | Month | Month_name | season | year | Frequency | Monetary | Recency_months | R_val | M_val | F_val | RFM_Score | Segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4259 | corte e tinta | 472.0 | 2019-01-23 | 3 | 3 | 30.5 | 1.0 | Corte | Cabelo | 0.0 | Female | -1.0 | No age | Ana | 1.0 | January | Winter | 2019.0 | 11.0 | 185.5 | 48.166667 | 1.0 | 2.0 | 3.0 | 132 | At Risk |
| 1 | 11622 | 3 de 3 | 1043.0 | 2023-01-31 | 26 | 7 | 0.0 | 81.0 | Radiofrequencia | Estética Corporal | 65.0 | Male | -1.0 | No age | Iza Marques | 1.0 | January | Winter | 2023.0 | 12.0 | 140.0 | 3.866667 | 4.0 | 2.0 | 3.0 | 432 | Potential Loyalists |
| 2 | 11623 | nan | 704.0 | 2023-01-30 | 27 | 3 | 13.0 | 8.0 | verniz gel | Manicure | 13.0 | Female | -1.0 | No age | Beatriz Fernandes | 1.0 | January | Winter | 2023.0 | 64.0 | 1511.0 | 2.233333 | 5.0 | 4.0 | 5.0 | 554 | Champions |
| 3 | 11624 | nan | 843.0 | 2023-01-23 | 26 | 13 | 15.0 | 8.0 | verniz gel | Manicure | 13.0 | Female | -1.0 | No age | Iza Marques | 1.0 | January | Winter | 2023.0 | 21.0 | 423.0 | 0.966667 | 5.0 | 3.0 | 4.0 | 543 | Champions |
| 4 | 10390 | 10 sessões | 728.0 | 2022-01-31 | 2 | 13 | 115.0 | 102.0 | pagamento | Outros | 0.0 | Female | 47.0 | [35-54] | Carla Sebastiao | 1.0 | January | Winter | 2022.0 | 90.0 | 2300.0 | 0.333333 | 5.0 | 5.0 | 5.0 | 555 | Champions |
# clients without ID don't have segment
df_frm[df_frm['Segment'].isna()]
| ID_Tra | Obs | ID_Cli | Data | ID_Emp | ID_MP | ValorTotal | ID_Prd | Produto | Categoria | Valor | Gender | Idade | range_age | Empregados | Month | Month_name | season | year | Frequency | Monetary | Recency_months | R_val | M_val | F_val | RFM_Score | Segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3127 | 10642 | nan | NaN | 2022-04-07 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | April | Spring | 2022.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3129 | 10643 | nan | NaN | 2022-04-07 | 0 | 0 | 0.0 | 1.0 | Cabelos | Cabelo | 0.0 | NaN | NaN | NaN | NaN | 4.0 | April | Spring | 2022.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4679 | 9622 | +buco | NaN | 2021-06-09 | 2 | 3 | 10.0 | 20.0 | dep. Cera Feminino axilas | Depliação Cera | 10.0 | NaN | NaN | NaN | Carla Sebastiao | 6.0 | June | Spring | 2021.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6185 | 11022 | a linha | NaN | 2022-07-07 | 28 | 3 | 5.0 | 14.0 | dep. Cera Feminino buco | Depliação Cera | 7.0 | NaN | NaN | NaN | Bruna Machado | 7.0 | July | Summer | 2022.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7633 | 9915 | madeixas | NaN | 2021-08-26 | 24 | 3 | 60.0 | 1.0 | Color | Cabelo | 0.0 | NaN | NaN | NaN | Lucinda | 8.0 | August | Summer | 2021.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 11145 | 3810 | nan | NaN | 2018-12-17 | 1 | 3 | 56.0 | 67.0 | Mss Pedras quentes | Massagem | 70.0 | NaN | NaN | NaN | Aurelio Mesquita | 12.0 | December | Fall | 2018.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 11906 | 108 | nan | NaN | NaT | 4 | 3 | 19.0 | 24.0 | dep Cera Masculina axilas | Depliação Cera | 19.0 | NaN | NaN | NaN | Patricia | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# remove clients without ID
df_frm = df_frm[~df_frm['Segment'].isna()]
for seg in df_frm['Segment'].unique():
df = df_frm[df_frm['Segment'] == seg]
segments_categoria_gender(df, seg)
We want to observe the trends among our clients by segment. For example, we are interested in identifying the additional treatments typically chosen by male clients who have received massages treatment in the segment championg between May and June.
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Champions'], 'Male', 'Massagem', ['May', 'June'] , contain_category = True)
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Champions'], 'Female', 'Massagem', ['May', 'June'] , contain_category = True)
In the graphs above, we can observe that men in the "Champions" segment, who have received massage treatments, typically also opt for IPL or pedicures. Moreover, the most common type of massage among them is hot stones. Women who have received massage treatments often also opt for body aesthetics and hair treatments. The most common type of massage among women is a 1-hour relaxing massage.
According to the graphs, we can create two offers: For men, the offer can be a combination of a massage with hot stones and a pedicure. For women, the offer can be a 1-hour relaxing massage combined with a body contouring treatment
In this category, we only have one man, so we are going to create an offer exclusively for women.
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == "Can't Loose"], 'Female', 'Cabelo', ['May', 'June'] , contain_category = True)
For this category, the offer can be a brush and a basic manicure for women.
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'At Risk'], 'Male', 'Manicure', ['May', 'June'], contain_category = True)
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'At Risk'], 'Female', 'Manicure',['May', 'June'] , contain_category = True)
Between May and June, there are no treatments for men, so we exclude them from this analysis. The offer for women includes brush and a basic manicure.
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Loyal Customers'], 'Male', 'Massagem', ['May', 'June'], contain_category = True)
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Loyal Customers'], 'Female', 'Massagem',['May', 'June'] , contain_category = True)
The offer for men would be hot stone massages along with a pedicure, while for women, the offer would be hot stone massages along with IPL (Intense Pulsed Light) treatment for half legs.
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Hibernating'], 'Male', 'Massagem', ['May', 'June'], contain_category = True)
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Hibernating'], 'Female', 'Cabelo',['May', 'June'] , contain_category = True)
In this category, men who have previously received massages have only opted for massages. Therefore, for them, the offer would be a 1-hour relaxing massage combined with a pedicure, as it has been common in other categories.
For women in this category, the offer would include a brush and hair cut.
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Potential Loyalists'], 'Male', 'Massagem', ['May', 'June'], contain_category = True)
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Potential Loyalists'], 'Female', 'Manicure',['May', 'June'] , contain_category = True)
The potential loyalist males prefer a 1-hour relaxing massage. Therefore, for them, the offer includes the massage along with a pedicure. For women, the offer consists of a manicure with gel polish and a pedicure.
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Potential Loyalists'], 'Male', 'Estética Corporal', ['May', 'June'], contain_category = True)
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Potential Loyalists'], 'Female', 'Manicure',['May', 'June'] , contain_category = True)
In this category, men prefer radiofrequency treatments but also enjoy massages. Therefore, the offer for men in this category would include a 30-minute relaxing massage along with radiofrequency treatment. For women, the offer consists of a manicure with gel polish and a pedicure.
year = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November', 'December']
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Promising'], 'Male', 'Depilação Laser', year, contain_category = True)
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Promising'], 'Female', 'Estética Corporal',year , contain_category = True)
Since there are no people in this category for the months of May and June, I will consider the entire year. Additionally, there is only one man in this category. For him, the offer can be laser hair removal for the beard area along with a pedicure. For women, the offer can be Lipolaser treatment along with a pedicure.
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'New Customers'], 'Male', 'Estética Corporal', year, contain_category = True)
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'New Customers'], 'Female', 'Estética Corporal',year , contain_category = True)
Since the new customers have only tried cryolipolysis treatments, the offer can include relaxing massages, as it is the specialty of the spa, along with cryolipolysis for both genders.
df_g, df_e = filter_Category_month(df_frm[df_frm['Segment'] == 'Need Attention'], 'Female', 'Manicure', year , contain_category = True)
In this category, there are only women, and the offer for them consists of a manicure and pedicure with gel polish.